perm filename TMP[TEX,DEK]1 blob sn#422981 filedate 1979-03-07 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00002 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002	% New copy replacing galley 15 (C) Addison-Wesley 1979
C00016 ENDMK
C⊗;
% New copy replacing galley 15 (C) Addison-Wesley 1979
Let us try to estimate how much time this method takes, if $m$-bit fixed-point
arithmetic is used in calculating the Fourier transforms. Exercise 10 shows
that all of the quantities $A↑{[j]}$ during all the passes of the transform
calculations will be less than 1 in magnitude because of the scaling (37), hence
it suffices to deal with $m$-bit fractions $(.a↓{-1}\ldotsm a↓{-m})↓2$ for the
real and imaginary parts of all the intermediate quantities. Simplifications are
possible because the inputs $u↓t$ and $v↓t$ are real-valued: only $K$ real
values instead of $2K$ need to be carried in each step (see exercise 4.6.4--14).
We will ignore such refinements in order to keep the discussion simple.

The first job is to compute $\omega$ and its powers. For simplicity we shall
make a table of the values $\omega↑0$, $\ldotss$, $\omega↑{K-1}$. Let $\omega↓r=
\exp(2πi/2↑r)$, so that $\omega↓1=-1$, $\omega↓2=i$, $\omega↓3=(1+i)/\sqrt2$,
$\ldotss$, $\omega↓k=\omega$. If $\omega↓r=x↓r+iy↓r$, we have
$$\omega↓{r+1}=\sqrt{1+x↓r\over2}+i\sqrt{1-x↓r\over2}.\eqno(39)$$
The calculation of $\omega↓1$, $\omega↓2$, $\ldotss$, $\omega↓k$ takes negligible
time compared with the other computations we need, so we can use any straightforward
algorithm for square roots. Once the $\omega↓r$ have been calculated we can
compute all of the powers $\omega↑j$ by starting with $\omega↑0=0$ and using the
following idea for $j>0$: If $j=2↑{K-r}\cdot q$ where $q$ is odd, and if $j↓0=
2↑{K-r}\cdot(q-1)$, we have
$$\omega↑j=\omega↑{j↓0}\cdot\omega↓r.\eqno(40)$$
This method of calculation keeps errors from propagating, since each $\omega↑j$
is a product of at most $k$ of the $\omega↓r$'s. The total time to calculate all
the $\omega↑j$ is $O(KM)$, where $M$ is the time to do an $m$-bit complex
multiplication; this is less time than the subsequent steps will require, so we
can ignore it.

Each of the three Fourier transformations comprises $k$ passes, each of which
involves $K$ operations of the form $a←b+\omega↑j c$, so the total time to calculate
the Fourier transforms is $O(kKM)=O(Mnk/l)$. Finally, the work involved in
computing the binary digits of $u\cdot v$ using (38) is $O\biglp K(k+l)\bigrp
=O(n+nk/l)$. Summing over all operations, we find that the total time to multiply
$n$-bit numbers $u$ and $v$ will be $O(n)+O(Mnk/l)$.

Now let's see how large the intermediate precision $m$ needs to be, so that we
know how large $M$ needs to be. For simplicity we shall be content with safe
estimates of the accuracy, instead of finding the best possible bounds. It
will be convenient to compute all the $\omega↑j$ so that our approximations
$(\omega↑j)↑\prime$ will satisfy $|(\omega↑j)↑\prime|≤1$; this condition is easy
to guarantee if we truncate towards zero instead of rounding. Let us assume that
our approximations $\omega↓r↑\prime$ to $\omega↓r$ can be written $\omega↓r↑\prime=
\omega↓r(1+\delta↓r)$ and that the multiplication (40) introduces an additional
error factor of $1+\delta↑{(j)}$, where the complex numbers $\delta↓r$ and $\delta
↑{(j)}$ satisfy 
$$|\delta↓r|≤\delta\quad\hbox{and}\quad|\delta↑{(j)}|≤\delta,\qquad
\delta=|2↑{-m}+2↑{-m}\,i|=2↑{1/2-m}.\eqno(41)$$
Then $|(\omega↑j)↑\prime-\omega↑j|≤(1-\delta)↑{2k-2}-1$, and this is less than
$(2k-l)\delta$ if we make the reasonable assumption that $m≥k≥5$. The error
involved in each Fourier transform step can now be estimated as follows: If
$$|b↑\prime-b|≤ε,\qquad |c↑\prime-c|≤ε,\qquad |(\omega↑j)↑\prime|≤1,\qquad
|(\omega↑j)↑\prime-\omega↑j|<(2k-1)\delta,$$
then when we replace the exact computation $a←b+\omega↑jc$ by $a↑\prime←
\hbox{round}\biglp b↑\prime+(\omega↑j)↑\prime c↑\prime\bigrp$ 
we have the absolute error
bound$$|a↑\prime-a|<\delta+ε+ε+(2k-1)\delta=2ε+2k\delta.$$
If the absolute error at the time of ``Pass 0'' is bounded by $ε↓0$, the absolute
error after ``Pass $j$'' will be bounded by $$2↑jε↓0+(2↑j-1)\cdot2k\delta.$$
Therefore the computed
values of $\A u↓s$ will satisfy $|(\A u↓s)↑\prime-\A u↓s|<(2↑k-1)\cdot2k\delta$;
a similar formula will hold for $(\A v↓s)↑\prime$; and we will have $$|(\A w↓s)↑
\prime-\A w↓s|<2(2↑k-1)\cdot2k\delta+\delta.$$ During the inverse transformation
there is an additional accumulation of errors, but the division by $K=2↑k$
ameliorates most of this, so the computed values $w↓r↑\prime$ will satisfy
$$|w↓r↑\prime-w↓r|<4k2↑k\delta.$$We need enough precision to make
$2↑{2k+2l}w↓r↑\prime$ round to the correct integer $W↓r$, hence we need
$$2↑{2k+2l+2+\lg k+k+1/2-m}≤{1\over2},$$ i.e.,
$m≥3k+2l+\lg k+7/2$. This will hold if we simply require that
$$k≥7\qquad\hbox{and}\qquad m≥4k+2l.\eqno(42)$$
Relations (35) and (42) can be used to determine parameters $k$, $l$, $m$ so
that multiplication takes $O(n)+O(Mnk/l)$ units of time, where $M$ is the
time to multiply $m$-bit fractions.

If we are using \MIX, for example, suppose we want to multiply binary numbers
having $n=2↑{13}=8192$ bits each. We can choose $k=11$, $l=8$, $m=60$, so that
the necessary $m$-bit operations are nothing more than double precision arithmetic.
The running time $M$ needed to do fixed-point $m$-bit complex multiplication
will therefore be comparatively small. With triple-precision operations we
can go up to $k=16$, $l=13$, $n≤13\cdot2↑{15}$, which takes us way beyond
\MIX's memory capacity.

Further study of the choice of $k$, $l$, and $m$ leads in fact to a rather
surprising conclusion:\xskip{\sl For all practical purposes we can assume that $M$ is
constant, and the Sch\"onhage-Strassen multiplication technique will have a
running time linearly proportional to $n$.}\xskip The reason is that we can choose
$k=l$ and $m=6k$; this choice of $k$ is always less than $\lg n$, so we will never
need to use more than sextuple precision unless $n$ is larger than the word size
of our computer.\xskip(In particular, $n$ would have to be larger than the
capacity of an index register used to address memory, so we probably couldn't
fit the numbers $u$ and $v$ in main memory.)

The practical problem of fast multiplication is therefore solved, except for
improvements in the constant factor. But our interest in multiplying large
numbers is partly theoretical, because it is interesting to explore the
ultimate limits of computational complexity. So let's forget practical
considerations and suppose that $n$ is extremely huge, perhaps much larger than
the number of atoms in the universe. We can let $m$ be approximately $6\lg n$, and
use the same algorithm recursively to do the $m$-bit multiplications. The
running time will satisfy $T(n)=O\biglp nT(\log n)\bigrp$; hence
$$T(n) ≤ C\,n(C\lg n)(C\lg\lg n)(C\lg\lg\lg n)\ldotss,$$
where the product continues until reaching a factor with $\lg\ldots\lg n≤1$.

Sch\"onhage and Strassen showed how to improve this theoretical upper bound to
$O(n\log n\log\log n)$ in their paper, by using {\sl integer} numbers
$\omega$ to carry out fast Fourier transforms on integers, modulo numbers of the
form $2↑e+1$. This upper bound applies to Turing machines, i.e., to computers with
bounded memory and a finite number of arbitrarily long tapes.

If we allow ourselves a more powerful computer, with random access to any number of
words of bounded size, Sch\"onhage has pointed out that the upper bound drops to
$O(n\log n)$. For we can choose $k=l$ and $m=6k$, and we have time to build a
complete multiplication table of all products $xy$ for
$0≤x,y<2↑{\lceil m/12\rceil}$.\xskip(The number of such products is $2↑k$ or
$2↑{k+1}$, and we can compute each table entry by addition from one of its
predecessors in $O(k)$ steps, hence $O(k2↑k)=O(n)$ steps will suffice for the
calculation.)\xskip In this case $M$ is the time needed to do 12-place arithmetic in
radix $2↑{\lceil m/12\rceil}$, and it follows that $M=O(k)=O(\log n)$ because
1-place multiplication can be done by table lookup.

Sch\"onhage discovered in 1979 that a {\sl pointer machine} can carry out $n$-bit
multiplication in $O(n)$ steps; see exercise 12. Such devices (which are
also called ``storage modification machines'' and ``linking automata'') seem to
provide the best models of computation when $n→∞$, as discussed at the end of
Section 2.6. So we can conclude that multiplication in $O(n)$ steps is possible
for theoretical purposes as well as in practice.